-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify checkpointer and make it work for large models #628
Conversation
@ali-ramadhan I will wait until part 1/3 is merged to review this. |
Codecov Report
@@ Coverage Diff @@
## master #628 +/- ##
==========================================
- Coverage 75.13% 70.45% -4.68%
==========================================
Files 118 118
Lines 2280 2288 +8
==========================================
- Hits 1713 1612 -101
- Misses 567 676 +109
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple questions / comments:
This PR finally upgrades the checkpointer so it can restore large models that take up more than 50% of system memory. It used to create a model then restore the fields which allocates twice as much memory as needed.
This is only relevant for CPU models --- correct? In other words, when running on the GPU we still need to load data onto the CPU and then transfer to the GPU. Previously, the option was available to create a model on the GPU, load data on to the CPU, and then copy that data to the GPU --- right? Or am I missing something?
edit: for example:
julia> using CuArrays
julia> a = CuArray{Float64}(undef, 10, 10); b = rand(10, 10);
julia> copyto!(a, b)
Does this PR impact the user API for checkpointing at all, or does it just change restore_from_checkpoint
, which does everything behind the scenes?
We may want to move the section on checkpointing that is currently in the documentation at Model setup > Output writers
to its own section within the documentation.
That is technically true although in general you tend to have a lot more CPU memory than GPU memory so I suspect it won't be an issue there. You could reduce memory allocation even further by loading all fields from disk into a temporary array, then copying to a CuArray one field at a time. But I don't think that'll work with JLD2 as the entire file is loaded into memory at once. You could do it with chunked NetCDF files, for example, where you read specific chunks into memory at a time.
You could do that but no user would have gone through the trouble.
Just changes what happens behind the scenes. The API is the same but checkpoint files now have file names like
Sounds like a good idea, will do before merging. |
Well, fair, I was more commenting that "we" could have done that behind the scenes for the GPU case.
I'm still confused about how this changes what was previously done. Did we previously create arrays on the GPU, copy checkpoint from the CPU to "temporary" GPU arrays, and then copy from temporary GPU arrays to previously-instantiated model data? Or am I missing something (is it possible to load data from disk directly to the GPU?) |
Yeah this is what we were previously doing in which is kinda stupid stupid because you create a temporary Now we construct the fields with the restored data and pass it to the model constructor so there are no temporary arrays and zero unnecessary allocations.
Hmmm, not sure but doesn't seem impossible. At the lowest level it'll have to do some host to device copies though I think. @leios or @vchuravy would know. |
This PR finally upgrades the checkpointer so it can restore large models that take up more than 50% of system memory. It used to create a model then restore the fields which allocates twice as much memory as needed. Now the data needed to restore the fields is passed to the model constructor so no double allocation. Some refactoring had to happen to make this possible.
This PR is also part 2/3 of making boundary conditions a field property.
Should help a lot with #602 and #603.
Resolves #416
Resolves #417
Note: This PR branches off #627.